In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks. Code and models are released at https://github.com/open-mmlab/mmdetection/tree/3.x/configs/rtmdet.
translated by 谷歌翻译
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from images simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a $1$D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries is designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves $91.04\%$ and $83.51\%$ in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone. Code will be availabl at \url{https://github.com/IDEA-Research/DQ-DETR}.
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
我们将Dino(\ textbf {d} etr与\ textbf {i} mpred de \ textbf {n} oising hand \ textbf {o} r boxes),一种最先进的端到端对象检测器。 % 在本文中。 Dino通过使用一种对比度方法来降级训练,一种用于锚定初始化的混合查询选择方法以及对盒子预测的两次方案,通过使用对比的方式来改善性能和效率的模型。 Dino在$ 12 $时代获得$ 49.4 $ ap,$ 12.3 $ ap in Coco $ 24 $时期,带有Resnet-50骨干和多尺度功能,可显着改善$ \ textbf {+6.0} $ \ textbf {ap}和ap {ap}和ap}和$ \ textbf {+2.7} $ \ textbf {ap}与以前的最佳detr样模型相比,分别是dn-detr。 Dino在模型大小和数据大小方面都很好地缩放。没有铃铛和哨子,在对objects365数据集进行了swinl骨架的预训练后,Dino在两个Coco \ texttt {val2017}($ \ textbf {63.2} $ \ textbf {ap ap})和\ testtt { -dev}(\ textbf {$ \ textbf {63.3} $ ap})。与排行榜上的其他模型相比,Dino大大降低了其模型大小和预训练数据大小,同时实现了更好的结果。我们的代码将在\ url {https://github.com/ideacvr/dino}提供。
translated by 谷歌翻译
We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
translated by 谷歌翻译
在这项研究中,我们深入研究了半监督对象检测〜(SSOD)所面临的独特挑战。我们观察到当前的探测器通常遭受3个不一致问题。 1)分配不一致,传统的分配策略对标记噪声很敏感。 2)子任务不一致,其中分类和回归预测在同一特征点未对准。 3)时间不一致,伪Bbox在不同的训练步骤中差异很大。这些问题导致学生网络的优化目标不一致,从而恶化了性能并减慢模型收敛性。因此,我们提出了一个系统的解决方案,称为一致的老师,以补救上述挑战。首先,自适应锚分配代替了基于静态的策略,该策略使学生网络能够抵抗嘈杂的psudo bbox。然后,我们通过设计功能比对模块来校准子任务预测。最后,我们采用高斯混合模型(GMM)来动态调整伪盒阈值。一致的老师在各种SSOD评估上提供了新的强大基线。只有10%的带注释的MS-Coco数据,它可以使用Resnet-50骨干实现40.0 MAP,该数据仅使用伪标签,超过了4个地图。当对完全注释的MS-Coco进行其他未标记的数据进行培训时,性能将进一步增加到49.1 MAP。我们的代码将很快开源。
translated by 谷歌翻译
众所周知,深度学习模型容易受到对抗性例子的影响。现有对对抗训练的研究已在这一挑战中取得了长足的进步。作为一个典型的特征,他们经常认为班级分布总体平衡。但是,在广泛的应用中,长尾数据集无处不在,其中头等级实例的数量大于尾巴类。在这种情况下,AUC比准确度更合理,因为它对课堂分布不敏感。在此激励的情况下,我们提出了一项早期试验,以探索对抗性训练方法以优化AUC。主要的挑战在于,积极和负面的例子与目标函数紧密结合。作为直接结果,如果没有数据集进行全面扫描,就无法生成对抗示例。为了解决此问题,基于凹入的正则化方案,我们将AUC优化问题重新制定为鞍点问题,该问题将成为实例函数。这导致端到端培训方案。此外,我们提供了提出的算法的收敛保证。我们的分析与现有研究不同,因为该算法被要求通过计算Min-Max问题的梯度来产生对抗性示例。最后,广泛的实验结果表明,在三个长尾数据集中,我们的算法的性能和鲁棒性。
translated by 谷歌翻译
ROC曲线(AUC)下的面积是机器学习的关键指标,它评估了所有可能的真实正率(TPR)和假阳性率(FPRS)的平均性能。基于以下知识:熟练的分类器应同时拥抱高的TPR和低FPR,我们转向研究一个更通用的变体,称为双向部分AUC(TPAUC),其中只有$ \ Mathsf {Tpr} \ ge ge ge ge \ alpha,\ mathsf {fpr} \ le \ beta $包含在该区域中。此外,最近的工作表明,TPAUC与现有的部分AUC指标基本上不一致,在该指标中,只有FPR范围受到限制,为寻求解决方案以利用高TPAUC开辟了一个新问题。在此激励的情况下,我们在本文中提出了优化该新指标的第一个试验。本课程的关键挑战在于难以通过端到端随机训练进行基于梯度的优化,即使有适当的替代损失选择。为了解决这个问题,我们提出了一个通用框架来构建替代优化问题,该问题支持有效的端到端培训,并深入学习。此外,我们的理论分析表明:1)替代问题的目标函数将在轻度条件下实现原始问题的上限,2)优化替代问题会导致TPAUC的良好概括性能,并且具有很高的可能性。最后,对几个基准数据集的实证研究表达了我们框架的功效。
translated by 谷歌翻译
最近提出的协作度量学习(CML)范式由于其简单性和有效性引起了人们对推荐系统(RS)领域的广泛兴趣。通常,CML的现有文献在很大程度上取决于\ textit {负抽样}策略,以减轻成对计算的耗时负担。但是,在这项工作中,通过进行理论分析,我们发现负抽样会导致对概括误差的偏差估计。具体而言,我们表明,基于抽样的CML将在概括性结合中引入一个偏差项,该术语是由per-use \ textit {total方差}(TV)量化的,在负面采样和地面真相分布引起的分布之间。这表明,即使有足够大的训练数据,优化基于采样的CML损耗函数也不能确保小概括误差。此外,我们表明偏见术语将消失,而无需负面抽样策略。在此激励的情况下,我们提出了一种有效的替代方案,而没有对CML进行负面采样的cml,name \ textit {无抽样协作度量学习}(SFCML),以消除实际意义上的采样偏见。最后,超过七个基准数据集的全面实验表达了所提出的算法的优势。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译